Synopsis Generator Notebook¶

Table of Contents¶

Introduction

  1. Introduction
  2. Librairies, Config & APIs
  3. Scrapping

Chapter 1 : Data Analysis & Visualization

  1. Section 1.1
  2. Section 1.2
  3. Section 1.3

Chapter 2 : Models

  1. Section 2.1 : Classification
  2. Section 2.2 : Generation
    1. Section 2.2.1 : Generation Synopsis
    2. Section 2.2.2 : Generation Poster

Chapter 3 : App

  1. Section 3.1 : All
  2. Section 3.2 : TV Shows
  3. Section 3.3 : Movies

Introduction & Librairies ¶

Introduction ¶

What is this ?¶
  • This is a potential creative app that uses generative models as a playground for users to generate overview of movies & tv show. It's mostly an app to learn about using and fine-tuning HuggingFace models like BERT & GPT-2.
What is this notebook ? Wouldn't python script be easier ?¶
  • This is the Notebook used to make all the preparations to make a python backend for the app. Using a notebook makes it easier to experiment and also for the exploratory data analysis that'll be first made to explore the data we have to work with. Once experimentation is done, we can extract the code and put it in a lib that will be pushed into a git repository and used in the final demo & app.
Why this app/idea ?¶
  • Because it enables me to learn more about how Hugging Face is used to produce models used in production. Hugging Face has become the face of NLP models and being able to use a pre-existing model and fine-tune it to fit a need is a skill that I wanted to train on.

Librairies, Config & APIs ¶

This is where all the import & config happens.

The biggest import are :

Librairies What it's used for
Pandas & Numpy Process & Handling of the data
Pyplot, Plotly & Seaborn Visualize the data
Tensorflow & Keras Deep Learning backend & framework
Transformers & Diffusers Deep Learning NLP models

We also need to config the API key for TMDB to be able to scrap the database and an HF token to be able to push fine-tuned model to the Hugging Face Hub.

In [ ]:
import os
import glob
import json
import requests
import calendar

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

import tensorflow as tf

from transformers import AutoTokenizer, create_optimizer, AdamWeightDecay, DataCollatorWithPadding, TFAutoModelForSequenceClassification, TFAutoModelForCausalLM

from transformers.keras_callbacks import PushToHubCallback, KerasMetricCallback
from keras.callbacks import TensorBoard
import evaluate

from datasets import Dataset

from diffusers import DiffusionPipeline, LCMScheduler

HF_TOKEN = os.environ['HF_TOKEN']
API_KEY = os.environ["TMDB_API_KEY"]
API_VERSION = 3
API_BASE_URL = f"https://api.themoviedb.org/{API_VERSION}"
RANDOM_STATE = 21
2023-12-11 22:20:30.641197: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-11 22:20:30.877217: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-11 22:20:30.877256: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-11 22:20:30.878391: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-11 22:20:31.019250: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
/home/alel/Projets/Python/SOD/Projet_Final/Final_Project/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
In [ ]:
import plotly
plotly.offline.init_notebook_mode()

Scrapping TMDB ¶

Using TMDB's API, we can scrape the database using a function that will cycle through every movies & tv shows id to get the details. Unfortunatly there's no way around this except choosing another database and TMDB doesn't offer images of their database, offering only an API service limited to 40 requests per second. This makes the scrapping a very heavy process. You can get the script here

from scripts.scrapper import scrapper_tmdb

## Movies
scrapper_tmdb('./data/jsons/04_11_2023/ids/movie_ids_04_11_2023.json', './data/jsons/04_11_2023/movie', 10, 'movie', API_BASE_URL, API_KEY)

## TV Shows
scrapper_tmdb('./data/jsons/04_11_2023/ids/tv_series_ids_04_11_2023.json', './data/jsons/04_11_2023/tv', 10, 'tv', API_BASE_URL, API_KEY)

Chapter 1 : Data Analysis & Vizualisation¶

Section 1.1 : Building the dataset ¶

After using the scrapper, we're left with multiple json that we can load with pandas using the function defined below

In [ ]:
def get_json(file):
    with open(file) as f:
        data = json.load(f)
    return data

def get_df_from_json_folder(filepath, cols=[]):
    json_pattern = os.path.join(filepath, '*.json')
    list_of_files = glob.glob(json_pattern)

    list_of_files.sort(key=lambda x: int(
        os.path.splitext(x)[0].split('-')[1]
    ))
    
    dfs = [pd.DataFrame(get_json(file)).T for file in list_of_files]

    return pd.concat(dfs, ignore_index=True)[cols] if cols else pd.concat(dfs, ignore_index=True)
In [ ]:
series_columns = ['backdrop_path', 'created_by', 'episode_run_time', 'first_air_date', 'genres', 'in_production', 'languages', 'last_air_date', 'last_episode_to_air', 'name', 'networks', 'number_of_episodes', 'number_of_seasons', 'origin_country', 'original_language', 'original_name', 'overview', 'popularity', 'production_companies', 'production_countries', 'seasons', 'spoken_languages', 'status', 'tagline', 'type', 'vote_average', 'vote_count']

movies_columns = ['backdrop_path', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count']

df_movies_raw = get_df_from_json_folder('./data/jsons/04_11_2023/movie', movies_columns)

df_series_raw = get_df_from_json_folder('./data/jsons/04_11_2023/tv', series_columns)

Section 2.2 : EDA ¶

Preprocessing data for EDA, we start by filtering columns that aren't useful for data analysis and revealing where the NaNs values are.

In [ ]:
def preprocess_df(df_raw, col_list, col_process_list):
    result_df = df_raw[col_list].copy()
    result_df = result_df.mask(result_df == '').mask(result_df.map(str).eq('[]'))

    for col in col_process_list:
        result_df[col] = result_df[col].apply(lambda row : [dicts['name'] for dicts in row] if type(row) == list else row)
        
    return result_df
In [ ]:
movies_eda_columns = ['belongs_to_collection', 'budget', 'genres', 'homepage', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count']

movies_cols_to_process = ['genres', 'production_companies', 'production_countries']

df_movies = preprocess_df(df_movies_raw, movies_eda_columns, movies_cols_to_process)

df_movies['belongs_to_collection'] = df_movies['belongs_to_collection'].fillna(value=np.nan).apply(lambda row: row['name'] if type(row) == dict else row)
In [ ]:
series_eda_columns = ['created_by', 'episode_run_time', 'first_air_date', 'genres', 'in_production', 'languages', 'last_air_date', 'name', 'networks', 'number_of_episodes', 'number_of_seasons', 'origin_country', 'original_language', 'original_name', 'overview', 'popularity', 'production_companies', 'production_countries', 'seasons', 'spoken_languages', 'status', 'tagline', 'type', 'vote_average', 'vote_count']

series_cols_to_process = ['created_by', 'genres', 'networks', 'production_companies', 'production_countries', 'seasons']

df_series = preprocess_df(df_series_raw, series_eda_columns, series_cols_to_process)
In [ ]:
for df in [df_movies, df_series]:
    df['spoken_languages'] = df['spoken_languages'].apply(lambda row: [dicts['english_name'] for dicts in row] if type(row) == list else row)

We can now look at the data once it's been preprocessed

In [ ]:
print(df_movies.shape)
df_movies.head(2)
(859971, 19)
Out[ ]:
belongs_to_collection budget genres homepage original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 Blondie Collection 0 [Comedy, Family] NaN en Blondie Blondie and Dagwood are about to celebrate the... 2.5 [Columbia Pictures] [United States of America] 1938-11-30 0 70 [English] Released The favorite comic strip of millions at last o... Blondie 7.2 7
1 NaN 0 [Adventure] NaN de Der Mann ohne Namen NaN 1.091 NaN [Germany] 1921-01-01 0 420 NaN Released NaN Peter Voss, Thief of Millions 0.0 0
In [ ]:
print(df_series.shape)
df_series.head(2)
(158626, 25)
Out[ ]:
created_by episode_run_time first_air_date genres in_production languages last_air_date name networks number_of_episodes ... popularity production_companies production_countries seasons spoken_languages status tagline type vote_average vote_count
0 NaN [60] 2004-01-12 [Drama] False [ja] 2004-03-22 Pride [Fuji TV] 11 ... 18.171 NaN [Japan] [Season 1] [Japanese] Ended NaN Scripted 8.2 13
1 [Kevin Smith, Scott Mosier, David Mandel] [22] 2000-05-31 [Animation, Comedy] False [en] 2002-12-22 Clerks [ABC, Comedy Central] 6 ... 31.201 [Touchstone Television, View Askew Productions... [United States of America] [Specials, Season 1] [English] Canceled NaN Scripted 7.012 86

2 rows × 25 columns

Now one important thing to look for is NaN values, we used the preprocessing step to highlight NaNs from void values (like empty lists or strings) so we can visualize them in a proper way, we can for that use a barchart or a heatmap

In [ ]:
nan_plot_dict = {
    "Movies" : df_movies.isna().T,
    "TV Show" : df_series.isna().T
    }

plt.subplots(figsize=(15,5))
plt.axis('off')
plt.suptitle('Missing data in DFs')
index = 1
for k,v in nan_plot_dict.items():
    plt.subplot(1,2,index)
    plt.title(k)
    sns.heatmap(v, cbar=False)
    index += 1
plt.tight_layout()
plt.savefig('./viz/nan_heatmap.png')
No description has been provided for this image
In [ ]:
nan_percentage_dict = {
    "Movies" : df_movies.isna().sum().sort_values(ascending=True) / len(df_movies) * 100,
    "TV Show" : df_series.isna().sum().sort_values(ascending=True) / len(df_series) * 100
    }

plt.subplots(figsize=(15,5))
plt.axis('off')
plt.suptitle('NaN percentage in DFs')
index = 1
for k,v in nan_percentage_dict.items():
    plt.subplot(1,2,index)
    plt.title(k)
    v.plot(kind='barh')
    index += 1
plt.tight_layout()
plt.savefig('./viz/nan_percent.png')
No description has been provided for this image

Now we can see that some columns have a lot of missing data. What we're interested in for generating overview & classifying genres are the overview & genres features. We're missing a lot of overview & genres in TV show and a little bit less overview are missing in movies. We can therefore see if we can train a model to infer those missing genres since trying to generate convincing overview based on title & genres seems a bit too difficult.

We can now try to have some general visualization of the data like the language

In [ ]:
plot_dict_movies = {
    "Original languages of movies" : 
    df_movies['original_language'].value_counts().head().sort_values(ascending=True),
    "Movies genres" :
    df_movies['genres'].value_counts().head().sort_values(ascending=True),
    "Original languages of tv shows" : 
    df_series['original_language'].value_counts().head().sort_values(ascending=True),
    "TV Shows genres" : 
    df_series['genres'].value_counts().head().sort_values(ascending=True)
}

plt.subplots(figsize=(15,10))
plt.axis('off')
index = 1
for key, value in plot_dict_movies.items():
    plt.subplot(2,2, index)
    plt.title(key)
    value.plot(kind='barh')
    index += 1
plt.savefig('./viz/languages_genres.png')
No description has been provided for this image
In [ ]:
df_movies['release_date'] = pd.to_datetime(df_movies['release_date'])
df_series['first_air_date'] = pd.to_datetime(df_series['first_air_date'], errors='coerce')

df_movies['release_year'] = df_movies['release_date'].dt.year 
df_movies['release_month'] = df_movies['release_date'].dt.month 

df_series['first_air_year'] = df_series['first_air_date'].dt.year 
df_series['first_air_month'] = df_series['first_air_date'].dt.month

print('TV Shows Air Date NaNs :', (df_series['first_air_date'].dt.year).isna().sum(), '\nMovies Air Date NaNs :', (df_movies['release_date'].dt.year).isna().sum())
TV Shows Air Date NaNs : 30083 
Movies Air Date NaNs : 79628
In [ ]:
plot_dict = {
    "TV Show" : df_series['first_air_date'].dt.year.value_counts().head(10), 
    "Movies" : df_movies['release_date'].dt.year.value_counts().head(10)
    }

plt.subplots(figsize=(15,5))
plt.axis('off')
plt.suptitle('Most Release Per Year in DFs')
index = 1
for k,v in plot_dict.items():
    plt.subplot(1,2,index)
    plt.title(k)
    v.plot(kind='bar')
    index += 1
plt.savefig('./viz/release_year.png')
No description has been provided for this image
In [ ]:
start_year, end_year = 2019, 2023
year_list = [i for i in range(start_year, end_year + 1)]

plot_list = [[df_movies, 'release_date'], [df_series, 'first_air_date']]
title_list = ['Movies', 'TV Show']

def get_count_by_month(df, col, year):
    result = df[df[col].dt.year == year][col].dt.strftime("%b").value_counts().reindex(calendar.month_abbr[1:])
    return result

plt.subplots(figsize=(20,5))
plt.axis('off')
plt.suptitle('Release by month from :')
for index in range(2):
    plt.subplot(1, 2, index + 1)
    plt.title(title_list[index])
    for year in year_list:
        get_count_by_month(plot_list[index][0], plot_list[index][1], year).plot(marker='o')
    plt.legend(year_list[:])
plt.savefig('./viz/release_month.png')
No description has been provided for this image
In [ ]:
plots = [df_movies[df_movies['runtime'] != 0]['runtime'],
df_series['episode_run_time']]

index = 1
plt.subplots(figsize=(15,5))
plt.axis('off')
for each in plots:
    plt.subplot(1,2,index)
    each.value_counts().head().plot(kind='barh')
    index += 1
plt.savefig('./viz/runtime.png')
No description has been provided for this image
In [ ]:
plot_dict = {
    "TV Show per countries" : df_series['production_countries'].explode().dropna().value_counts(), 
    "Movies per countries" : df_movies['production_countries'].explode().dropna().value_counts()
    }

for title, plot in plot_dict.items():
    fig = go.Figure(data=go.Choropleth(locationmode='country names',  locations=plot.index.values, text=plot.index, z=plot.values, colorscale = 'Greys'))
    fig.update_layout(height=600, width=800, title_text=title,title_x=0.5)
    fig.show()

Chapter 2 : Models ¶

The app has 2 principal functionnality that LLMs are able to help us with :

  • Classification overviews into genre :
    • The user write an overview and the model tells us which genre this overview fits into
  • Generation of overviews from genres & title :
    • The user write a title & select genres, the model give us an overview that fits the title & genres

Section 2.1 : Classification ¶

Classify genres based on overview, the model we'll use is a simple random forest which is nice for categorical data

In [ ]:
cols = ['genres', 'overview', 'title']

df_clf = pd.concat([
    df_movies[df_movies['overview'].notna()],
    df_series[df_series['overview'].notna()].rename(columns={'name' : 'title'})
    ], ignore_index=True)[cols].rename(columns={'genres' : 'label'})
In [ ]:
def filter_dominant_genres(genres_list):
    dominants_genres_list = ['Drama', 'Documentary', 'Comedy', 'Animation', 'Horror']
    if len(genres_list) > 1 and genres_list[0] in dominants_genres_list:
        processed_genres = [genre for genre in genres_list if genre not in dominants_genres_list]
        result = processed_genres[0] if len(processed_genres) > 0 else genres_list[0]
        return result
    else:
        return genres_list[0]
In [ ]:
replace_dict = {
    'Sci-Fi & Fantasy' : 'Science Fiction',
    'War & Politics' : 'History',
    'Musical' : 'Music',
    'News' : 'Reality',
    'Talk' : 'Reality',
    'Soap' : 'Drama',
    'Kids' : 'Family',
    'Action & Adventure' : 'Adventure',
    'War' : 'History'
}

df_clf['label'] = df_clf['label'].apply(lambda row : filter_dominant_genres(row) if type(row) == list else row).replace(replace_dict)

df_clf['text'] = df_clf['title'] + ' | ' + df_clf['overview']

candidates = df_clf['label'].value_counts().index.to_list()
In [ ]:
df_clf_to_train = df_clf[~df_clf['label'].isna()].copy()
df_clf_to_train.shape
Out[ ]:
(571212, 4)
In [ ]:
genre_index = df_clf_to_train['label'].factorize()[1]

df_clf_to_train['label'] = df_clf_to_train['label'].factorize()[0]

df_clf_to_train.head()
Out[ ]:
label overview title text
0 0 Blondie and Dagwood are about to celebrate the... Blondie Blondie | Blondie and Dagwood are about to cel...
1 1 Love at Twenty unites five directors from five... Love at Twenty Love at Twenty | Love at Twenty unites five di...
3 0 Elmo is making a very, very super special surp... Sesame Street: Elmo Loves You! Sesame Street: Elmo Loves You! | Elmo is makin...
4 1 After the coal mine he works at closes and his... Ariel Ariel | After the coal mine he works at closes...
5 1 Nikander, a rubbish collector and would-be ent... Shadows in Paradise Shadows in Paradise | Nikander, a rubbish coll...
In [ ]:
tokenizer_clf = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_classifier(examples):
    return tokenizer_clf(examples["text"], truncation=True)

cols = ['text', 'label']

raw_datasets = Dataset.from_pandas(df_clf_to_train[cols], preserve_index=True).train_test_split(seed=RANDOM_STATE)

tokenized_dataset = raw_datasets.map(tokenize_classifier, batched=True, remove_columns=['__index_level_0__'])

data_collator = DataCollatorWithPadding(tokenizer=tokenizer_clf, return_tensors="tf")
Map:   0%|          | 0/428409 [00:00<?, ? examples/s]
Map:   0%|          | 0/142803 [00:00<?, ? examples/s]
In [ ]:
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):

    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    return accuracy.compute(predictions=predictions, references=labels)
In [ ]:
n_label = len(genre_index)

id2label = {id:genre_index[id] for id in range(n_label)}
label2id = {genre_index[id]:id for id in range(n_label)}
In [ ]:
batch_size = 16
num_epochs = 3
batches_per_epoch = len(tokenized_dataset["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

model_clf = TFAutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=n_label, id2label=id2label, label2id=label2id
)

tf_train_set = model_clf.prepare_tf_dataset(
    tokenized_dataset["train"],
    shuffle=True,
    batch_size=8,
    collate_fn=data_collator,
)

tf_validation_set = model_clf.prepare_tf_dataset(
    tokenized_dataset["test"],
    shuffle=False,
    batch_size=8,
    collate_fn=data_collator,
)

model_clf.compile(optimizer=optimizer)

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

push_to_hub_callback = PushToHubCallback(
    output_dir="overview_classifier_final",
    tokenizer=tokenizer_clf,
)

tensorboard_callback = TensorBoard(log_dir="./overview_classifier_final/logs")

callbacks = [metric_callback, tensorboard_callback, push_to_hub_callback]

model_clf.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)
2023-11-25 00:08:41.652008: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-25 00:08:41.671413: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-25 00:08:41.671587: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-25 00:08:41.675123: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-25 00:08:41.675337: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-25 00:08:41.675458: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-25 00:08:41.953050: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-25 00:08:41.953470: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-25 00:08:41.953486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1977] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-11-25 00:08:41.953669: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-25 00:08:41.953964: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5606 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060 Ti, pci bus id: 0000:01:00.0, compute capability: 8.6
2023-11-25 00:08:46.371265: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: Permission denied
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/home/alel/Projets/Python/SOD/Projet_Final/overview_classifier_final is already a clone of https://huggingface.co/Alirani/overview_classifier_final. Make sure you pull the latest changes with `repo.git_pull()`.
Epoch 1/3
2023-11-25 00:08:58.792376: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1fea6190 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-25 00:08:58.792414: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3060 Ti, Compute Capability 8.6
2023-11-25 00:08:58.799206: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-11-25 00:08:58.850631: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8902
2023-11-25 00:08:58.910458: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
53551/53551 [==============================] - 6487s 121ms/step - loss: 1.4370 - val_loss: 1.3354 - accuracy: 0.5644
Epoch 2/3
53551/53551 [==============================] - 5988s 112ms/step - loss: 1.1724 - val_loss: 1.3251 - accuracy: 0.5701
Epoch 3/3
53551/53551 [==============================] - 5753s 107ms/step - loss: 1.1282 - val_loss: 1.3251 - accuracy: 0.5701
Out[ ]:
<keras.src.callbacks.History at 0x7f32a04d4df0>

Section 2.2 : Generation ¶

Section 2.2.1 : Synopsis Generation ¶

To generate synopsis, we will fine-tune a GPT-2 model using the title, genre and overview of either a movie or tv show and check how it behave. We will use the classifier we just fine-tuned to predict missing genres.

First we get a dataframe with only missing genres

In [ ]:
df_clf_to_pred = df_clf[df_clf['label'].isna()].dropna(subset=['title']).copy()

df_clf_to_pred.shape
Out[ ]:
(230912, 4)

We load the model we just pushed to the hugging_face hub so we're sure we always use the latest version and we make a function that'll infer the genre of a title + overview input

In [ ]:
def get_pred_genre(input, tokenizer, model):
    tokenized_input = tokenizer(input, max_length=512, truncation=True, padding='max_length', return_tensors="tf")
    logits = model(**tokenized_input).logits
    predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
    return model.config.id2label[predicted_class_id]
In [ ]:
df_clf_to_pred['label'] = df_clf_to_pred.apply(lambda row: get_pred_genre(row['text'], tokenizer_clf, model_clf), axis=1)
df_clf_to_pred['text'] = df_clf_to_pred['title'] + ' | ' + df_clf_to_pred['label'] + ' | ' + df_clf_to_pred['overview']

df_clf_to_train['label'] = df_clf_to_train['label'].replace(id2label)
df_clf_to_train['text'] = df_clf_to_train['title'] + ' | ' + df_clf_to_train['label'] + ' | ' + df_clf_to_train['overview']
In [ ]:
cols = ['genres', 'text']

df_gen = pd.concat([df_clf_to_train, df_clf_to_pred])
In [ ]:
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_gen = TFAutoModelForCausalLM.from_pretrained(model_name)
All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
In [ ]:
def generate_synopsis(model, tokenizer, title):
    input_ids = tokenizer(title, return_tensors="tf")
    output = model.generate(input_ids['input_ids'], max_length=150, num_beams=5, no_repeat_ngram_size=2, top_k=50, attention_mask=input_ids['attention_mask'])
    synopsis = tokenizer.decode(output[0], skip_special_tokens=True)
    return synopsis

prompt = "Blondie | Family | "

print(f"Model output before fine-tuning: {generate_synopsis(model_gen, tokenizer, prompt)}\nWhat we're expecting : {df_gen.iloc[0]['text']}")
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Model output before fine-tuning: Blondie | Family | __________________
What we're expecting : Blondie | Family | Blondie and Dagwood are about to celebrate their fifth wedding anniversary but this happy occasion is marred when the bumbling Dagwood gets himself involved in a scheme that is promising financial ruin for the Bumstead family.

We can see the model's result aren't what we're expecting, but now we can fine-tune and see if it improves our model output

In [ ]:
def tokenize_generator(examples):
    return tokenizer(examples["text"])

raw_datasets = Dataset.from_pandas(df_gen, preserve_index=True).train_test_split(seed=RANDOM_STATE)

tokenized_datasets = raw_datasets.map(
    tokenize_generator, batched=True, remove_columns=['title', 'overview', 'text', 'label', '__index_level_0__']
)

def group_texts(examples, block_size = 128):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result
Map:   0%|          | 0/601593 [00:00<?, ? examples/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (1076 > 1024). Running this sequence through the model will result in indexing errors
Map:   0%|          | 0/200531 [00:00<?, ? examples/s]
In [ ]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
)

optimizer = AdamWeightDecay(lr=2e-5, weight_decay_rate=0.01)

model_gen.compile(optimizer=optimizer)

train_set = model_gen.prepare_tf_dataset(
    lm_datasets["train"],
    shuffle=True,
    batch_size=8,
)

validation_set = model_gen.prepare_tf_dataset(
    lm_datasets["test"],
    shuffle=False,
    batch_size=8,
)

tensorboard_callback = TensorBoard(log_dir="./distilgpt2-finetuned-synopsis-genres_final/logs")

push_to_hub_callback = PushToHubCallback(
    output_dir="./distilgpt2-finetuned-synopsis-genres_final",
    tokenizer=tokenizer
)

callbacks = [tensorboard_callback, push_to_hub_callback]

model_gen.fit(train_set, validation_data=validation_set, epochs=4, callbacks=callbacks)

model_gen.save_pretrained("./data/model/distilgpt2-finetuned-synopsis-genres_final")
Map:   0%|          | 0/601593 [00:00<?, ? examples/s]
Map:   0%|          | 0/200531 [00:00<?, ? examples/s]
/home/alel/.local/lib/python3.10/site-packages/keras/src/optimizers/legacy/adam.py:118: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
  super().__init__(name, **kwargs)
Cloning https://huggingface.co/Alirani/distilgpt2-finetuned-synopsis-genres_final into local empty directory.
WARNING:huggingface_hub.repository:Cloning https://huggingface.co/Alirani/distilgpt2-finetuned-synopsis-genres_final into local empty directory.
Epoch 1/4
    6/40901 [..............................] - ETA: 1:20:22 - loss: 4.5849WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0315s vs `on_train_batch_end` time: 0.0793s). Check your callbacks.
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0315s vs `on_train_batch_end` time: 0.0793s). Check your callbacks.
40901/40901 [==============================] - 5120s 125ms/step - loss: 4.0094 - val_loss: 3.8307
Epoch 2/4
40901/40901 [==============================] - 5655s 138ms/step - loss: 3.8810 - val_loss: 3.7803
Epoch 3/4
40901/40901 [==============================] - 6510s 159ms/step - loss: 3.8211 - val_loss: 3.7502
Epoch 4/4
40901/40901 [==============================] - 6037s 148ms/step - loss: 3.7786 - val_loss: 3.7310
Section 3.2.2 : Poster Generation ¶
In [ ]:
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0").to("cuda") 
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")

results = pipe(
    prompt="The poster of a movie called Blondie",
    num_inference_steps=4,
    guidance_scale=0.0,
)
results.images[0]
Loading pipeline components...: 100%|██████████| 7/7 [00:01<00:00,  5.45it/s]
The config attributes {'skip_prk_steps': True} were passed to LCMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.
100%|██████████| 4/4 [14:32<00:00, 218.20s/it]
Out[ ]:
No description has been provided for this image

Chapter 3 : App ¶

1.1 - Search

In [ ]:
from scripts.queries import search_query

results = search_query('movie', 'Batman', API_BASE_URL, API_KEY)
print("réponse : ", results.status_code, "\n output : ")
pd.DataFrame().from_dict(json.loads(results.text)['results']).head()
réponse :  200 
 output : 
Out[ ]:
adult backdrop_path genre_ids id original_language original_title overview popularity poster_path release_date title video vote_average vote_count
0 False /frDS8A5vIP927KYAxTVVKRIbqZw.jpg [14, 28, 80] 268 en Batman Batman must face his most ruthless nemesis whe... 41.401 /cij4dd21v2Rk2YtUQbV5kW69WB2.jpg 1989-06-21 Batman False 7.220 7243
1 False /bxxupqG6TBLKC60M6L8iOvbQEr6.jpg [28, 35, 80] 2661 en Batman The Dynamic Duo faces four super-villains who ... 20.018 /zzoPxWHnPa0eyfkMLgwbNvdEcVF.jpg 1966-07-30 Batman False 6.301 791
2 False /xEG5iP1qZCiDt4BefSpLy1d54zE.jpg [28, 12, 80, 878, 53, 10752] 125249 en Batman Japanese master spy Daka operates a covert esp... 10.790 /AvzD3mrtokIzZOiV6zAG7geIo6F.jpg 1943-07-16 Batman False 6.400 59
3 False /p2aiSLQZx7AVZrY9cfOOPv1u5Zk.jpg [27, 53, 878] 1160196 en Batman A young man learns the consequence of tempting... 3.991 /qIvTMHX2MIYG2Ij4jP5dkKgMqUo.jpg 2023-07-28 Batman False 6.000 3
4 False /tRS6jvPM9qPrrnx2KRp3ew96Yot.jpg [80, 9648, 53] 414906 en The Batman In his second year of fighting crime, Batman u... 136.362 /74xTEgt7R36Fpooo50r9T25onhq.jpg 2022-03-01 The Batman False 7.701 8833

1.2 - Details

In [ ]:
from scripts.queries import get_details

result = get_details('movie', 299054, API_BASE_URL, API_KEY)
print("réponse : ", result.status_code, "\n output : ")
pd.DataFrame().from_dict(json.loads(result.text), orient='index').T
réponse :  200 
 output : 
Out[ ]:
adult backdrop_path belongs_to_collection budget genres homepage id imdb_id original_language original_title ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 False /j9LX1sF7WSXmJlnhf0RGpWzEC0i.jpg {'id': 126125, 'name': 'The Expendables Collec... 100000000 [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam... https://expendables.movie/ 299054 tt3291150 en Expend4bles ... 2023-09-15 58000000 103 [{'english_name': 'English', 'iso_639_1': 'en'... Released They'll die when they're dead. Expend4bles False 6.428 824

1 rows × 25 columns

1.3 - Trending

In [ ]:
from scripts.queries import get_trendings

result = get_trendings(1, 'day', API_BASE_URL, API_KEY)
print("réponse : ", result.status_code, "\n output : ", pd.DataFrame().from_dict(json.loads(result.text)['results']).shape)
pd.DataFrame().from_dict(json.loads(result.text)['results']).head()
réponse :  200 
 output :  (20, 15)
Out[ ]:
adult backdrop_path id name original_language original_name overview poster_path media_type genre_ids popularity first_air_date vote_average vote_count origin_country
0 False /wyLHV7oP0O88aVFFkS2Ue71Of6f.jpg 96648 Sweet Home ko 스위트홈 As humans turn into savage monsters and wreak ... /u8sLAJUvY9yzWqtVfKRQz5yin3D.jpg tv [18, 10765] 315.221 2020-12-18 8.400 1089 [KR]
1 False /jEDILaZtJOqNTEnFqWnYsCEVHpr.jpg 94244 Obliterated en Obliterated A special forces team thwarts a deadly plot in... /5g3UrcV6oguAcI3myMKb6wi28y5.jpg tv [35, 10759] 121.012 2023-11-30 7.917 12 [US]
2 False /vcFW09U4834DyFOeRZpsx9x1D3S.jpg 57243 Doctor Who en Doctor Who The Doctor is a Time Lord: a 900 year old alie... /4edFyasCrkH4MKs6H4mHqlrxA6b.jpg tv [10759, 18, 10765] 533.627 2005-03-26 7.460 2703 [GB]
3 False /oT81JufYbkP9BkFZm32VwvXRBOc.jpg 239770 Doctor Who en Doctor Who The Doctor and friends travel from the dawn of... /2I8aMfUvgRKQvEpBIQVKMbXgMsi.jpg tv [10759, 18, 10765] 111.138 7.620 25 [GB]
4 False /2bzS31ujJhUlKzXrU5nQ2OiV1G9.jpg 202411 Monarch: Legacy of Monsters en Monarch: Legacy of Monsters After surviving Godzilla's attack on San Franc... /uwrQHMnXD2DA1rvaMZk4pavZ3CY.jpg tv [18, 10765, 10759] 1436.206 2023-11-16 8.267 195 [US]

1.4 - Top-rated

In [ ]:
from scripts.queries import get_top_rated

results = get_top_rated("tv", 1, API_BASE_URL, API_KEY)
print("réponse : ", results.status_code, "\n output : ")
pd.DataFrame().from_dict(json.loads(results.text)['results']).head()
réponse :  200 
 output : 
Out[ ]:
adult backdrop_path genre_ids id origin_country original_language original_name overview popularity poster_path first_air_date name vote_average vote_count
0 False /9faGSFi5jam6pDWGNd0p8JcJgXQ.jpg [18, 80] 1396 [US] en Breaking Bad When Walter White, a New Mexico chemistry teac... 354.925 /3xnWaLQjelJDDF7LT1WBo6f4BRe.jpg 2008-01-20 Breaking Bad 8.900 12708
1 False /rkB4LyZHo1NHXFEDHl9vSD9r1lI.jpg [16, 18, 10765, 10759] 94605 [US] en Arcane Amid the stark discord of twin cities Piltover... 91.310 /fqldf2t8ztc9aiwn3k6mlX3tvRT.jpg 2021-11-06 Arcane 8.743 3450
2 False /a6ptrTUH1c5OdWanjyYtAkOuYD0.jpg [10759, 35, 16] 37854 [JP] ja ワンピース Years ago, the fearsome Pirate King, Gol D. Ro... 100.373 /e3NBGiAifW9Xt8xD5tpARskjccO.jpg 1999-10-20 One Piece 8.725 4187
3 False /rBF8wVQN8hTWHspVZBlI3h7HZJ.jpg [16, 35, 10765, 10759] 60625 [US] en Rick and Morty Rick is a mentally-unbalanced but scientifical... 924.938 /gdIrmf2DdY5mgN6ycVP0XlzKzbE.jpg 2013-12-02 Rick and Morty 8.700 8839
4 False /A6tMQAo6t6eRFCPhsrShmxZLqFB.jpg [10759, 16, 10765] 31911 [JP] ja 鋼の錬金術師 FULLMETAL ALCHEMIST Disregard for alchemy’s laws ripped half of Ed... 157.214 /5ZFUEOULaVml7pQuXxhpR2SmVUw.jpg 2009-04-05 Fullmetal Alchemist: Brotherhood 8.694 1804

¶

Created by Alirani Created by Alirani